Ways to Improve N-gram Language Models for Ocr and Speech Recognition of Slavic Languages

نویسندگان

V. Taranukha

Taras Shevchenko

چکیده

The problems of n-gram models for the OCR and speech recognition for the Slavic languages are investigated. The paper proposes methods applicable for most Slavic languages. Two approaches are tested: filtering of the n-gram model and the alternative ways of carrying out the smoothing. The filtering relies on heuristics based on frequencies and morphological features of words. The smoothing uses classes based on morphological features in combinations with new discounting formula. The smoothing can also be combined with inner filtering. The numerical experiments for the Ukrainian language show that both approaches produce interesting results. However, smoothing is more promising while being more complex and requiring further investigation of development of proper classes based on morphological information in order to outperform standard smoothing techniques.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cache-Augmented Latent Topic Language Models for Speech Retrieval

We aim to improve speech retrieval performance by augmenting traditional N-gram language models with different types of topic context. We present a latent topic model framework that treats documents as arising from an underlying topic sequence combined with a cache-based repetition model. We analyze our proposed model both for its ability to capture word repetition via the cache and for its sui...

متن کامل

مقایسه روش های طیفی برای شناسایی زبان گفتاری

Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...

متن کامل

Lemmatized Latent Semantic Model for Language Model Adaptation of Highly Inflected Languages

We present a method to adapt statistical N-gram models for large vocabulary continuous speech recognition of highly inflected languages. The method combines morphological analysis, latent semantic analysis (LSA) and fast marginal adaptation for building topic-adapted trigram models, based on a background language model and very short adaptation texts. We compare words, lemmas and morphemes as b...

متن کامل

Speech Recognition on English-Mandarin Code-Switching Data using Factored Language Models - with Part-of-Speech Tags, Language ID and Code-Switch Point Probability as Factors pdfsubject=Multilingual Speech Recognition

Code-switching is defined as ”the alternate use of two or more languages in the same utterance or conversation” [1]. CS is a wide-spread phenomenon in multilingual communities, where multiple languages are concurrently used in a conversation. For automatic speech recognition (ASR), particularly intra-sentential code-switching poses an interesting challenge due to the multilingual context for la...

متن کامل

Jezikovno neodvisno modeliranje pregibnega jezika

This article concerns statistical language modelling of Slovenian language for automatic speech recognition. We investigate various techniques for overcoming the difficulties in modelling highly inflected languages. Slavic languages are particularly challenging languages and Slovenian language is one of them. Two main problems arise when modelling Slovenian language in comparison to English. Th...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Ways to Improve N-gram Language Models for Ocr and Speech Recognition of Slavic Languages

نویسندگان

چکیده

منابع مشابه

Cache-Augmented Latent Topic Language Models for Speech Retrieval

مقایسه روش های طیفی برای شناسایی زبان گفتاری

Lemmatized Latent Semantic Model for Language Model Adaptation of Highly Inflected Languages

Speech Recognition on English-Mandarin Code-Switching Data using Factored Language Models - with Part-of-Speech Tags, Language ID and Code-Switch Point Probability as Factors pdfsubject=Multilingual Speech Recognition

Jezikovno neodvisno modeliranje pregibnega jezika

عنوان ژورنال:

اشتراک گذاری